Chris Pollett > Old Classses >
CS267

( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [
Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1]  [Hw2]  [Hw3]
  [Hw4]  [Hw5]  [Quizzes]

Practice Exams:
  [Midterm]  [Final]

                           












HW#1 --- last modified February 06 2019 04:15:52..

Solution set.

Due date: Feb 22

Files to be submitted:
  Hw1.zip

Purpose: To gain some experience with language modeling, yo dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that.

The first file I want you to make is the plain text file problems.txt. In it put your answers to the following questions:

  1. Open Source Shakespeare has stats about Shakespeare's works. For example, there are 884,421 total words in Shakespeare's 43 works. Using this site you can calculate the maximum likelihood estimate for any term in Shakespeare. (a) Compute the MLE for each of the terms: (i) a, (ii) before (iii) dagger, (i) I, (v) is, (vi) me, (vii) see, (viii) this. Then (b) compute using these the likelihood of the phrase: Is this a dagger I see before me?
  2. Using the table of the top 15 most frequent words in Shakespeare from the Open Source Shakespeare site, and assuming Zipf's law, try to determine a constant `alpha` and `\beta` such that the frequency of the `i`th word in this list of 15 is as close as possible to `f(i) = \beta/i^(\alpha)`. Note taking the log of both sides we have: `\log f(i) = -\alpha\cdot(\log i) + \log \beta`. As `\beta` does not depend on `i`, we can view `\log \beta` as a constant, so this is just a line fitting problem. Comment on how well Zipf's law fits this data.
  3. Using the Open Source Shakespeare site one could imagine also being able to compute a Markov model for the works of Shakespeare. By using advanced search and regular expressions, compute the transition arcs and their probabilities (a) into and (b) out from the word "ransom" for the Shakespeare Markov Model. Here we imagine "ransom" is an arc in the Markov model and we pretending the nodes it is connected to and from only have incoming edges based on words before and after ransom. I am not asking you to use an algorithm like Baum-Welch, just make a crude estimate based on the frequencies of the into and out of words versus total frequency for in and out.

For the second part of the homework, I want you to download and install Yioop as described in class. I want you to conduct an experiment to see how well Yioop's page summarizer is working. Next a want you to pick five Wikipedia pages of your choice. If possible, choose one of the pages to be in a language other than English in which you are fluent. For each page, come up with a list of queries that that page might be able to answer. Make sure your queries only involve words that appear on the given page. Next I want you to create by hand a short summary (say 5 sentences) for each of your Wikipedia pages. Your summaries should be in the same language as those pages. If you are working in a group of more than one person have a different person make the queries from the one who makes the summary. Next, for each of your summaries look at the list of five queries for the page the summary came from. Determine which queries have all their terms still in the summary and make a count of this for each page. For example, you might have wiki page 1 had 3 out 5 queries completely contained in the summary. Log in to the root account of your copy of Yioop, go to the Page Options activety and then the Test Options tab. For each Wikipedia page your choose, copy and paste its HTML source into the Test Options text area and click Test Process Page. Look at the output determine the number of times Yioop got the Human language of the Wikipedia page correct. Then take the summaries you get from each Wikipedia page you chose and calculate as you did for your own human summaries the number of times the summary still contained all of the query terms for each of your queries. Write up all of these experiments and maybe a paragraph conclusion in the file summaries.txt.

For the coding part of the homework I want you to create a PHP program that computes a simple text-based, positional inverted index. Your program will be run from the command line using a line of format:

php InvertedIndex.php some_dir
Here some_dir is the path to some directory in your file system. I will test your code using PHP 5.6.10, PHP is reasonably backward compatible so as long as you don't use any really new features, if it works on your machine it should work on mine. Your code should follow the coding standards in PSR-1 and PSR-2. Your code when run with a line like the above should glob an array of all .txt files in some_dir. Using glob without the GLOB_NOSORT flag should mean this array will be sorted alphabetically. For example, we might get an array ["a.txt", "b.txt", "c.txt"]. Define the document id of a file name to be its position in this array. So c.txt has document id 2. Your program should then read in each text file in turn and use them to create a text-based, positional index which it outputs to stdout. There should be one index entry for each term found in any of these text files and you should output these entries in ascending alphabetical order. An index entry should look like:
term:num_docs:freq:(doc_id1, doc_id1_pos1,doc_id1_pos2,...,doc_id1_pos_m_id1),...more records, if exist

For example,

apple:3:6:(1,1,6),(5,2,4,5),(23,87)

In this example, "apple" is the term. It occurs in 3 documents and the total number of times it occurs is 6. This is then followed by a sorted by document id list of three records. The record (1,1,6) indicates the word apple appeared in the document with id 1 as the 1st and 6th token in that document. The record (5,2,4,5) indicates that the word apple appeared in document 5 at locations 2, 4, and 5. Finally, (23,87) indicates that apple appeared in document 23 at location 87.

Your program should output a sorted sequence of lines each of which is an index entry. For example, it might begin with lines like:

apple:3:6:(1,1,6),(5,2,4,5),(23,87)
banana:2:6:(2,3,6,9),(4,7,18,20)
cucumber:5:5:(0,16),(2,16),(3,16),(4,16),(5,16)
...

If you have doubts on whether your code is outputting things as I want them, feel free to demo it to me before the homework is due.

Point Breakdown

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong) 3pts
summaries.txt experiments. 2pts
Coding guidelines followed 1pt
Program does glob text files in directory provided in the command line. Program reads in texts from given directory.(1/2pt each) 1pt
Program creates an inverted index in memory (1pt keeps count of num_docs and frequency for each term, 1pt keeps track of information needed for each record associated with an entry). 2pts
Program outputs inverted index as described 1pt
Total10pts